## [1] 1599 14
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "rating"
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ rating : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating
## Min. : 8.40 3: 10 bad : 63
## 1st Qu.: 9.50 4: 53 average:1319
## Median :10.20 5:681 good : 217
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
The high concentration of wines in the center region and the lack of outliers might be a problem for generating a predicting model later on.
There is a high concentration of wines with fixed.acidity close to 8 (the median) but there are also some outliers that shift the mean up to 9.2.
The distribution appears bimodal at 0.4 and 0.6 with some outliers in the higher ranges.
Now this is strange distribution. 8% of wines do not present critic acid at all. Maybe a problem in the data collection process?
A high concentration of wines around 2.2 (the median) with some outliers along the higher ranges.
We see a similar distribution with chlorides.
The distributions peaks at around 7 and from then on resembles a long tailed distribution with very few wines over 60.
As expected, this distribution resembles closely the last one.
The distribution for density has a very normal appearence.
pH also looks normally distributed.
For sulphates we see a distribution similar to the ones of residual.sugar and chlorides.
We see the same rapid increase and then long tailed distribution as we saw in sulfur.dioxide. I wonder if there is a correlation between the variables.
There are 1599 observation of wines in the dataset with 12 features . There is one categorical variable (quality) and the others are numerical variables that indicate wine physical and chemical properties of the wine.
Other observations: The median quality is 6, which in the given scale (1-10) is a mediocre wine. The better wine in the sample has a score of 8, and the worst has a score of 3. The dataset is not balanced, that is, there are a more average wines than poor or excelent ones and this might prove challenging when designing a predicting algorithm.
The main feature in the data is quality. I’d like to determine which features determine the quality of wines.
The variables related to acidity (fixed, volatile, citric.acid and pH) might explain some of the variance. I suspect the different acid concentrations might alter the taste of the wine. Also, residual.sugar dictates how sweet a wine is and might also have an influence in taste.
I created a rating variable to improve the later visualizations.
you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Citric.acid stood out from the other distributions. It had (apart from some outliers) an retangularly looking distribution which given the wine quality distribution seems very unexpected.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Alcohol has negative correlation with density. This is expected as alcohol is less dense than water.
Volatile.acidity has a positive correlation with pH. This is unexpected as pH is a direct measure of acidity. Maybe the effect of a lurking variable?
Residual.sugar does not show correlation with quality. Free.sulfur.dioxide and total.sulfur.dioxide are highly correlated as expected.
Density has a very strong correlation with fixed.acidity. The variables that have the strongest correlations to quality are volatile.acidity and alcohol.
As the correlation table showed, fixed.acidity seems to have little to no effect on quality.
volatile.acidity seems to be an unwanted feature is wines. Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.
We can see the soft correlation between these two variables. Better wines tend to have higher concentration of citric acid.
Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.
Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.
The ranges are really close to each other but it seems too little sulfur dioxide and we get a poor wine, too much and we get an average wine.
As a superset of free.sulfur.dioxide there is no surprise to find a very similar distribution here.
Better wines tend to have lower densities, but this is probably due to the alcohol concentration. I wonder if density still has an effect if we hold alcohol constant.
Altough there is definitely a trend (better wines being more acid) there are some outliers. I wonder how the distribution of the different acids affects this
It is really strange that an acid concentration would have a positive correlation with pH. Maybe Simpsons Paradox?
Altought it is not clear what each cluster means, it seems Simpsons paradox is in fact present.
Because we know pH measures acid concentration using a log scale, it is not surprise to find stronger correlations between pH the log of the acid concentrations. We can investigate how much of the variance in pH these tree acidity variables can explain using a linear model.
##
## Call:
## lm(formula = pH ~ I(log10(citric.acid)) + I(log10(volatile.acidity)) +
## I(log10(fixed.acidity)), data = subset(wine, citric.acid >
## 0))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47184 -0.06318 -0.00003 0.06447 0.32265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.230862 0.040578 104.266 < 2e-16 ***
## I(log10(citric.acid)) -0.052187 0.008797 -5.933 3.72e-09 ***
## I(log10(volatile.acidity)) -0.049788 0.021248 -2.343 0.0193 *
## I(log10(fixed.acidity)) -1.071983 0.038987 -27.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1068 on 1463 degrees of freedom
## Multiple R-squared: 0.4876, Adjusted R-squared: 0.4866
## F-statistic: 464.1 on 3 and 1463 DF, p-value: < 2.2e-16
## Warning in loop_apply(n, do.ply): Removed 132 rows containing non-finite
## values (stat_boxplot).
It seems the three acidity variables can only explain half the variance in PH. The mean error is specially bad on poor and on excellent wines. This leads me to believe that there are other component that affect acidity.
Interesting. Altough there are many outliers in the medium wines, better wines seem to have a higher concentration of sulphates.
The correlation is clear here. With an increase in alcohol graduation we see an increase in the concentration of better graded wines. Given the high number of outliers it seems we cannot rely on alcohol alone to produce better wines. Let’s try using a simple linear model to investigate.
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the R-squared value it seems alcohol alone only explains about 22% of the variance in quality. We’re going to need to look at the other variables to generate a better model.
investigation. How did the feature(s) of interest vary with other features in the dataset?
Fixed.acidity seems to have little to no effect on quality
Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.
Better wines tend to have higher concentration of citric acid.
Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.
Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.
Better wines tend to have lower densities.
In terms of pH it seems better wines are more acid but there were many outliers. Better wines also seem to have a higher concentration of sulphates.
Alcohol graduation has a strong correlation with quality, but like the linear model showed us it cannot explain all the variance alone. We’re going to need to look at the other variables to generate a better model.
I verified the strong relation between free and total sulfur.dioxide.
I also checked the relation between the acid concentration and pH. Of those, only volatile.acidity surprised me with a positive coefficient for the linear model.
The relationship between the variables total.sulfur.dioxide and free.sulfur.dioxide.
When we hold alcohol constant, there is no evidence that density affects quality which confirms our earlier suspicion.
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 7 rows containing missing values
## (geom_point).
Interesting! It seems that for wines with high alcohol content, having a higher concentration of sulphates produces better wines.
The reverse seems to be true for volatile acidity. Having less acetic acid on higher concentration of alcohol seems to produce better wines.
Low pH and high alcohol concentration seem to be a good match.
Almost no variance in the y axis compared to the x axis. Lets try the other acids.
High citric acid and low acetic acid seems like a good combination.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$fixed.acidity
## t = 36.2341, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
Altough there seems to a correlation between tartaric acid and citric acid concentrations, nothing stands out in terms of quality.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH,
## data = training_data)
##
## =============================================================================
## m1 m2 m3 m4 m5 m6
## -----------------------------------------------------------------------------
## (Intercept) -0.066 -0.604** 0.605* 0.670** 0.294 1.328*
## (0.220) (0.224) (0.248) (0.257) (0.289) (0.516)
## alcohol 0.357*** 0.339*** 0.306*** 0.305*** 0.315*** 0.362***
## (0.021) (0.020) (0.020) (0.020) (0.020) (0.021)
## sulphates 1.099*** 0.745*** 0.770*** 0.780*** 0.980***
## (0.138) (0.137) (0.139) (0.138) (0.139)
## volatile.acidity -1.199*** -1.272*** -1.333***
## (0.125) (0.146) (0.147)
## citric.acid -0.128 -0.436*
## (0.130) (0.170)
## fixed.acidity 0.047**
## (0.017)
## pH -0.631***
## (0.152)
## -----------------------------------------------------------------------------
## R-squared 0.232 0.280 0.343 0.344 0.349 0.293
## adj. R-squared 0.231 0.279 0.341 0.341 0.346 0.291
## sigma 0.704 0.682 0.651 0.651 0.649 0.676
## F 289.048 185.949 166.182 124.873 102.212 131.779
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1022.548 -991.540 -947.687 -947.203 -943.227 -983.004
## Deviance 473.685 444.023 405.216 404.808 401.465 436.188
## AIC 2051.096 1991.080 1905.374 1906.407 1900.454 1976.008
## BIC 2065.693 2010.544 1929.704 1935.602 1934.516 2000.337
## N 959 959 959 959 959 959
## =============================================================================
I did not include pH in the same formula with the acids to avoid colinearity problems.
High alcohol contents and high sulphate concentrations combined seem to produce better wines.
Yes, I created several models. The most prominent of them was composed of the variables alcohol, sulphates, and the acid variables. There are two problems with it. First the low R squared score suggest that there is missing information to propely predict quality. Second, both the residuals plot and the cross validation favors average wines. This is probably a reflection of the high number of average wines in the training dataset or it could mean that there is missing information that would help predict the edge cases. I hope that the next course in the nanodegree will help me generate better models :) .
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
This is a very strange distribution. It does not match what we would expect from a variable collected in a experimental situation.
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).
High alcohol contents and high sulphate concentrations combined seem to produce better wines.
The linear model with the highest R squared value could only explain around 35% of the variance in quality. Also, the clear correlation showed by the residual plot earlier seems to reinforce that there is missing information to better predict both poor and excellent wines.